This article introduces Streamlit, a Python library for building data dashboards, as a solution for Python programmers to create graphical front-ends without needing to delve into CSS, HTML, or JavaScript. The author, a seasoned data engineer, explains how Streamlit and similar tools enable the creation of attractive dashboards, marking a shift from traditional tools like Tableau or Quicksight. This piece serves as the first in a series focusing on Streamlit, with future articles planned on Gradio and Taipy. The author aims to replicate similar layouts and functionalities across dashboards using consistent data.
These one-liners provide quick and effective ways to assess the quality and consistency of the data within a Pandas DataFrame.
| Code Snippet | Explanation |
| --- | --- |
| `df.isnull().sum()` | Counts the number of missing values per column. |
| `df.duplicated().sum()` | Counts the number of duplicate rows in the DataFrame. |
| `df.describe()` | Provides basic descriptive statistics of numerical columns. |
| `df.info()` | Displays a concise summary of the DataFrame including data types and presence of null values. |
| `df.nunique()` | Counts the number of unique values per column. |
| `df.apply(lambda x: x.nunique() / x.count() * 100)` | Computes the percentage of unique values for each column. |
| `df.isin( value » ).sum()` | Counts the number of occurrences of a specific value across all columns. |
| `df.applymap(lambda x: isinstance(x, type_to_check)).sum()` | Counts the number of values of a specific type (e.g., int, str) per column. |
| `df.dtypes` | Lists the data type for each column in the DataFrame. |
| `df.sample(n)` | Returns a random sample of n rows from the DataFrame. |
This article explains how to quickly detect data quality issues and identify their causes using Python for ETL pipelines. It discusses strategies to minimize the time required to fix data quality problems.
This article provides Python tricks and techniques for data ingestion, validation, processing, and testing in data engineering projects. It offers practical solutions for streamlining the code, including tips for data validation, handling errors, and testing.
An exploration of the benefits of switching from the popular Python library Pandas to the newer Polars for data manipulation tasks, highlighting improvements in performance, concurrency, and ease of use.
An in-process analytics database, DuckDB can work with surprisingly large data sets without having to maintain a distributed multiserver system. Best of all? You can analyze data directly from your Python app.
An article discussing a simple and free way to automate data workflows using Python and GitHub Actions, written by Shaw Talebi.
Intro to Streamlit
- Simple and complex Streamlit example
- Data and state management in Streamlit apps
- Data widgets for Streamlit apps
- Deploying Streamlit apps